Decision Trees

Overview

Decision trees are one of the most intuitive models in machine learning. They partition the feature space into axis-aligned rectangular regions by recursively splitting on individual features, then assign a class (classification) or value (regression) to each region. The learning algorithm is beautifully simple: at each node, exhaustively search every feature and every threshold, pick the split that best purifies the resulting child nodes, then repeat recursively until a stopping criterion is met.

Key Concepts

We focus on classification trees below; regression trees follow the same structure but replace impurity with variance reduction.

Gini Impurity

Gini impurity measures how mixed a node is. For a node containing samples from \(K\) classes:

\[\text{Gini} = 1 - \sum_{i=1}^{K} p_i^2\]

where \(p_i\) is the proportion of samples belonging to class \(i\).

A perfectly pure node (all one class) has \(\text{Gini} = 0\).
Maximum disorder (uniform distribution across classes) gives the highest Gini. For binary classification with a 50/50 split, \(\text{Gini} = 0.5\).

Gini can be interpreted as the probability that two randomly drawn samples from the node belong to different classes.

Entropy (Alternative Criterion)

Entropy is another common impurity measure, used in the ID3 and C4.5 algorithms:

\[H = -\sum_{i=1}^{K} p_i \log_2 p_i\]

A pure node has \(H = 0\); a 50/50 binary node has \(H = 1\) bit. In practice, Gini and entropy produce very similar trees — Gini is slightly faster to compute (no logarithm), which is why scikit-learn uses it as the default.

Information Gain

Information gain (IG) is the reduction in impurity after a split. For a parent node split into children \(\{C_1, C_2, \dots\}\):

\[\text{IG} = \text{Impurity}(\text{parent}) - \sum_{i} \frac{n_i}{n}\,\text{Impurity}(C_i)\]

where \(n\) is the sample count at the parent and \(n_i\) is the count in child \(C_i\). The weighting by \(\frac{n_i}{n}\) ensures that larger children contribute more to the weighted impurity. The split that maximizes IG is chosen.

The Split Search Loop (CART Algorithm)

The CART (Classification and Regression Trees) algorithm performs a greedy, exhaustive search at each node:

For each feature \(j \in \{1, \dots, d\}\):
- Sort the \(n\) samples by feature \(j\)’s values.
- Candidate thresholds are the midpoints between consecutive distinct feature values — so the thresholds come directly from the data, not from an arbitrary grid.
- For each candidate threshold, compute the IG of splitting there.
Select the feature \(j^*\) and threshold \(t^*\) that produce the highest IG.
Partition the samples: left child gets \(\{\mathbf{x} \mid x_{j^*} < t^*\}\), right child gets the rest.
Recurse on each child.

The sorting step dominates: the cost is \(O(n \log n)\) per feature per node. Over the full tree, the total training cost is \(O(d \cdot n^2 \log n)\) in the worst case (every node sees all \(n\) samples), though in practice balanced splits reduce this significantly.

Stopping Criteria

Without constraints, a tree will keep splitting until every leaf is pure — memorizing the training data perfectly. This is a classic case of overfitting. Stopping criteria prevent this:

Criterion	Effect
Max depth	Limits the number of levels in the tree
Min samples per leaf	Prevents leaves with too few samples
Min samples to split	A node must have at least this many samples to be considered for splitting
Min impurity decrease	A split must reduce impurity by at least this amount
Pure node	Recursion naturally stops when \(\text{Gini} = 0\)

Prediction

To classify a new sample \(\mathbf{x}\), start at the root and follow the split rules at each internal node — go left if \(x_j < t\), right otherwise — until reaching a leaf. The leaf’s majority class is the prediction. For regression trees, the leaf returns the mean of its training targets.

Advantages and Limitations

Advantages:

Interpretable — the tree can be visualized and explained to non-experts
No feature scaling required — splits are based on thresholds, not distances
Handles both numerical and categorical features
Fast prediction: \(O(\text{depth})\) per sample

Limitations:

High variance — small changes in data can produce very different trees
Greedy splitting — each split is locally optimal, not globally
Axis-aligned boundaries only — diagonal or curved boundaries require many splits to approximate
Prone to overfitting without regularization

These limitations motivate ensemble methods: bagging (Random Forests) averages many decorrelated trees to reduce variance, while boosting (Gradient Boosted Trees) sequentially corrects errors to reduce bias.

Regression Trees

For regression, the tree predicts a continuous value instead of a class. The impurity measure changes from Gini to variance (or equivalently, mean squared error):

\[\text{Impurity}(\text{node}) = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2\]

where \(\bar{y}\) is the mean target value in the node. The split that maximizes the reduction in variance is chosen, and each leaf predicts \(\bar{y}\) for its samples. Everything else — the greedy search, stopping criteria, pruning — works identically.